An Introduction to LOTLib

by Eric Bigelow

LOTlib is a Python 2.x library that provides a generic framework for representing and reasoning over "language of thought" (LOT) models. LOTlib can be used to model compositional learning of concepts over a wide range of domains. This tutorial is intended to answer the questions: what are some basic concepts we should be familiar with, and how do we implement grammars in LOTlib to model an arbitrary domain?

An Arbitrary Domain (the Number Game)

Let's say we're curious about how quickly a learner (a child, perhaps) could detect label a compositional pattern given a list of input data. In the number game, the challenge is to detect what mathematical expression most likely generated the input data.

For example, ${1, 3, 5, 7, 9, 11}$ probably corresponds to odd integers between 1 and 11. It could also be the odd numbers from 1 to 19, or from -577 to 73, but these would be less likely. If our learner needs to learn a concept like odd numbers, they're going to get lots of data with some amount of noise - once in a while, we'll mishear someone or misunderstand a reference, and associate odd with an even-numbered set.


*Belief state after seeing (a) 1 example, and (b) 4 examples*

We can compose our mathematical primitives into expression trees that correspond to specific hypotheses. E.g. TODO

Suppose I tell you I am thinking of some simple arithemetical concept $C$. I give you a series of randomly chosen positive examples $X = \{x_{1},...,x_{N}\}$, and ask you whether any other test cases y belong to the extension of C. Suppose, for simplicity, that our domain includes only numbers between 1 and 100, so the task is to compute whether $y \in C \mid X$, for $y \in \{1, . . . , 100\}$.

A sub-task of this is to find the hypothesis $h$ that most likely models the concept $C$ that I was thinking of:

$$\underset{h}{argmax}\ p(h \mid X)$$

However, this doesn't help us immediately answer whether $y \in C \mid X$. What we want to do is estimate the posterior probability $p(h|X)$ for each hypothesis we can think of, and sum the probability of the original data $X$ being generated, over all hypotheses $h$ that are in the set of hypotheses $H_y$ such that all concepts for every hypothesis in $H_y$ contain y: $\ \forall h \in H_y . y \in h$

$$p(y \in C \mid X) = \underset{h \in H_y}{\sum} p(h \mid X)$$

if our input is: $X = \{8, 16, 2, 8, 4, 2, 16, 2, 8, 4\}$, we might generate a hypothesis $h$ that this data $X$ was generated by the expression $f(n) = 2^n$.

In the number game, we chose a domain of integers. For this tutorial, our domain will be $D = \{ n \in \mathbb{N} \mid 1 \geq n \geq 100 \}$. So for example, the even numbers could be generated by the tree: $\{m \in D \mid \exists n \in D . (m = 2n)\}$

The Basics

Grammars & Primitives

LOTlib is fundamentally based around the concept of abstract grammars - sets of rules. When an individual attempts to learn patterns from input data, we can model this individual as having an internal grammar of primitives. In the number game, input data consists of a list of integers.

Learner primitives include integers themselves 1, 2, 3, ..., as well as arithmatic operations such as +, -, *, ^. Using these primitives, a grammar is capable of composing infinitely long trees. E.g. we could have the number $1000$ with only the primitives 1 and + (i.e. one plus one repeated a thousand times gives us $1000$:

$$1000 = (1 + (1 + (1 + (1 + (1 + (. . .))))))$$

In this example, we will consider rules as mapping either to a set of integers (e.g. $\{2,4,8,16\}$) or an arithematic expression (e.g. $\lambda x . x^{2} + 1$). We design our grammar to map these expressions over sets, so all well-formed trees generated by the grammar resolve to specific sets.

With a set of input integers, a learner should be able to learn the tree that was most likely used to generate these items.

Hypotheses

It's worth noting here that our grammar is probabilistic, which means that each rule has an associated probability weight. Because $Expr \overset{4.0}{\longrightarrow} n$ has a weight of 4 and the rule $Expr \overset{2.0}{\longrightarrow} 2$ has weight 2, given an 'Expr' node the first rule is twice as likely to be called as the second rule.

Any sentence generated by this grammar can be part of a hypothesis, which associates posterior, prior, and likelihood probabilities with an expression.

Setting up a grammar

We would also like to consider sets of integers generated using set-theoretic operations. For example, we might have the input $D = \{2,4,5,8,10,15,16,20,25,30,32,...\}$. The concept that produced this is numbers that are either a power of two or a multiple of five: $D = \{n \mid \exists x . n = 2^{x}\} \cup \{n \mid \exists x . n = 5x \}$.

Any expressions such as $2^{n} - 1$ or $3n$ will be applied to some other set, for example the set produced by the rule $Range(1,100)$. When sets are evaluated, they will inevitably bottom out with evaluations of $Range(Num, Num)$, with various set rules applied such as $\cup$ & $\cap$.

We can get these expressions by beginning in our grammar with a 'Start' node and using rules with 'Expr' on the left-hand side to generate more nodes until all our nodes are leaves. Using this grammar, we can generate expressions such as:

$ map(\lambda n. 2n, range(1,100)) \sim \{2,4,6,8,...\}\ \ \ \ [even\ numbers] \\ map(\lambda n. 8, range(1,100)) \sim \{8\}\ \ \ \ [eight] \\ map(\lambda n. 2^{n}, range(1,100)) \sim \{2,4,8,16,32,64\}\ \ \ \ [powers\ of\ two] \\ map(\lambda n. 2^{n}-1, range(1,100)) \sim \{1,3,7,15,31,63\}\\ map(\lambda n. 2n, range(1,100)) \cup map(\lambda n. 5n, range(1,100)) \sim \{2,4,5,6,8,...\}\ \ \ \ [even\ or\ factor\ of\ five] \\ map(\lambda n. 8, range(1,100)) \cup map(\lambda n. 17, range(1,100)) \sim \{8,17\}\ \ \ \ [eight\ or\ seventeen] \\ map(\lambda n. 2^{n}, range(1,100)) \setminus map(\lambda n. 16, range(1,100)) \sim \{2,4,8,32,64,...\}\ \ \ \ [powers\ of\ two\ except\ 16] \\ Range(1,37)\ \ \ \ [range\ from\ 1\ to\ 37] $

Grammar

First Rule

$ Start \longrightarrow Set $

Sets

$ Set \overset{5.0}{\longrightarrow} Range(Expr, Expr) \\ Set \overset{0.5}{\longrightarrow} Set \cup Set \\ Set \overset{0.5}{\longrightarrow} Set \cap Set \\ Set \overset{0.5}{\longrightarrow} Set \setminus Set \\ Set \overset{1.0}{\longrightarrow} Map(Func, Set) \\ $

Mapping Expressions to Sets

$ Func \longrightarrow \lambda n . Expr(n) \\ Expr \overset{2.0}{\longrightarrow} n $

Expressions

$ Expr \overset{0.5}{\longrightarrow} Expr + Expr \\ Expr \overset{0.5}{\longrightarrow} Expr - Expr \\ Expr \overset{0.5}{\longrightarrow} Expr * Expr \\ Expr \overset{0.5}{\longrightarrow} Expr^{Expr} $

Expression Terminals

$ Expr \overset{1.0}{\longrightarrow} 1 \\ Expr \overset{1.0}{\longrightarrow} 2 \\ ... \\ Expr \overset{1.0}{\longrightarrow} 10 $


In [1]:
"""
Import statements

"""
%pylab inline
%reload_ext gvmagic
from pylab import *
from LOTlib import lot_iter, MHSampler, prior_sample
import LOTlib.Miscellaneous as misc
import LOTlib.Visualization as viz
from LOTlib.Examples.NumberGame.Model import *

"""
Setting up our grammar

"""
del grammar
grammar = Grammar()
grammar.add_rule('START', '', ['SET'], 1.)

# Sets
grammar.add_rule('SET', 'union_', ['SET', 'SET'], .5)
grammar.add_rule('SET', 'setdifference_', ['SET', 'SET'], .5)
grammar.add_rule('SET', 'intersection_', ['SET', 'SET'], .5)

# Range of numbers, e.g. [1,100] (numbers 1 through 100)
grammar.add_rule('SET', 'range_set_', ['EXPR', 'EXPR', 'bound=100'], 5.)
grammar.add_rule('SET', 'range_set_', ['1', '100', 'bound=100'], 6.)

# Mapping expressions over sets of numbers
grammar.add_rule('SET', 'mapset_', ['FUNC', 'SET'], 1.)
grammar.add_rule('FUNC', 'lambda', ['EXPR'], 1., bv_type='EXPR', bv_p=2.0)

# Expressions
grammar.add_rule('EXPR', 'plus_',  ['EXPR', 'EXPR'], .5)
grammar.add_rule('EXPR', 'minus_', ['EXPR', 'EXPR'], .5)
grammar.add_rule('EXPR', 'times_', ['EXPR', 'EXPR'], .5)
grammar.add_rule('EXPR', 'ipowf_', ['EXPR', 'EXPR'], .5)

# Terminals
for i in range(0, 10):
    grammar.add_rule('EXPR', str(i), None, 1.)

print 'Now we have a grammar!'


Populating the interactive namespace from numpy and matplotlib
Now we have a grammar!

In [2]:
%reload_ext gvmagic

for i in range(10):
    t = grammar.generate()
    print t.log_probability(), str(t)
    %dotstr t.dot_string()


-35.385576445 mapset_(lambda y1: 6, union_(setdifference_(range_set_(8, 0, bound=100), range_set_(times_(3, 3), 0, bound=100)), range_set_(5, 0, bound=100)))
%3 0 mapset_ 01 lambda 0->01 02 union_ 0->02 011 6 01->011 021 setdifference_ 02->021 022 range_set_ 02->022 0211 range_set_ 021->0211 0212 range_set_ 021->0212 02111 8 0211->02111 02112 0 0211->02112 02121 times_ 0212->02121 02122 0 0212->02122 021211 3 02121->021211 021212 3 02121->021212 0221 5 022->0221 0222 0 022->0222
-0.810930216216 range_set_(1, 100, bound=100)
%3 0 range_set_
-10.5116649071 mapset_(lambda y1: y1, range_set_(2, 7, bound=100))
%3 0 mapset_ 01 lambda 0->01 02 range_set_ 0->02 011 y2 01->011 021 2 02->021 022 7 02->022
-5.96306507259 range_set_(5, 1, bound=100)
%3 0 range_set_ 01 5 0->01 02 1 0->02
-0.810930216216 range_set_(1, 100, bound=100)
%3 0 range_set_
-10.0698321548 intersection_(range_set_(6, 3, bound=100), range_set_(1, 100, bound=100))
%3 0 intersection_ 01 range_set_ 0->01 02 range_set_ 0->02 011 6 01->011 012 3 01->012
-11.2048120876 mapset_(lambda y1: 7, range_set_(7, 8, bound=100))
%3 0 mapset_ 01 lambda 0->01 02 range_set_ 0->02 011 7 01->011 021 7 02->021 022 8 02->022
-15.7327926349 union_(range_set_(times_(4, 6), 3, bound=100), range_set_(1, 100, bound=100))
%3 0 union_ 01 range_set_ 0->01 02 range_set_ 0->02 011 times_ 01->011 012 3 01->012 0111 4 011->0111 0112 6 011->0112
-0.810930216216 range_set_(1, 100, bound=100)
%3 0 range_set_
-5.96306507259 range_set_(5, 9, bound=100)
%3 0 range_set_ 01 5 0->01 02 9 0->02

Inferring $p(h \mid X)$ to compute $p(y \in C \mid X)$

The Problem

On input $X$, a learner should be able to recognize that these are all powers of 2.

$$ X = \big\{2, 8, 16, 32, 64 \big\} $$

In other words, the concept which was used to generate these numbers is:

$$ C = \big\{ 2,4,8,16,32,64 \big\} $$

In other words, integer powers of two that lie in the range of $1$ to $100$.

$$ C_{h} = \big\{n \mid n \in map \big( \lambda x . 2^x, range(1,100) \big)\big\} $$

However, we can imagine other hypotheses which could generate this same data. For example, perhaps the rule is for powers of 2 except for the numbers 8 & 32:

$$ C_{h'} = C - \{4\} $$

$D$ could have also been generated by an 'even numbers only' expression:

$$ C_{h''} = \big\{n \mid n \in map \big( \lambda x . 2x, range(1,100) \big)\big\} $$

or even by an expression with no constraints on the domain, where

$$ C_{h'''} = \big\{n \mid n \in range(1,100) \big\} $$

This raises the question: how do we select the best hypotheses?

Bayesian Inference

The value we care to calculate is the posterior probability $p(h|D)$ of each hypothesis, "the chance that this hypothesis was what generated the data we're seeing". Accordingly, the hypothesis that most probably generated $X$ should be the best hypothesis

$$p(h \mid X) = \frac{p(X \mid h)\ p(h)}{\sum_{h'} p(X \mid h')\ p(h')}$$

We use two other values in estimating this - the prior probability of the hypothesis $p(h)$, or "the chance that our hypothesis would be generated from our grammar" - and the likelihood of the data being generated by the hypothesis $p(D|h)$, or "the chance that given this hypothesis, we would generate this data".

Likelihood of Data Given Hypothesis     $p(X \mid h)$

Likelihood is important since it lets us prefer the best hypothesis between multiple possible choices. E.g. input data $D = \{4,8,32\}$ could be generated by either hypothesis, $n \longrightarrow 2^{n}$ or $n \longrightarrow 2n$. The first hypothesis ('powers of two') maps to the set $\mathcal{H} = \{2,4,8,16,32,64\}$; each datum has likelihood $1/6$ of being generated, since the length of $\mathcal{H}$ is 6. The second hypothesis ('even numbers') maps to $\mathcal{H} = \{2,4,6,8,10,...,98,100\}$, so each datum has likelihood $1/50$ of being generated.

As we can see from this, it is more likely that $D = \{4,8,32\}$ was generated by the first hypothesis than the second.

TODO: put some equations here...

Hypothesis Complexity / Prior Probability     $p(h)$

Prior probability is important as well, since this will let us prefer simpler hypotheses.. If we consider the two equivalent hypotheses, e.g. $2n = times(2,n)$ & $(2n+1-1) = minus(plus(times(2,n),1,1)$, we may prefer the first hypothesis if it is more likely to be generated than the second.

Noise Parameter     $\alpha$

Children are capable of learning abstract rules using sparse, noisy, exclusively positive data. For example, a child might be repeatedly told that ${1,3,5,7,9,11,15,...}$ are odd numbers, but never explicitly learn that 13 is an odd number. There might even be an error or two, where perhaps the child mistook a set of cardinality 3 to have 2 items, and was told the set size was odd.

In this case, the child's input data could map to the set $D = \{1,1,1,2,3,3,3,...,7,9,9,9,11,11,11,15,15,...\}$, while their hypothesis maps to $C = \{1,3,5,7,9,11,...,99\}$. The correct hypothesis for this data would correspond to the most likely rule used to generate this data. The following rule was used to generate $D$: $$ n \longrightarrow 2n - 1 $$


In [4]:
### Input data
data = [2, 2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 16, 16, 16, 32, 32, 32, 32, 32, 32, 64, 64, 64]
# data = input("Please enter your input data X (list of integers): ")
assert(isinstance(data, list)), "That's not a list!"

### Range of number in domain => [1, domain]
domain = 100

### Noise parameter
alpha = .99

### Number of samples to generate
num_iters = 5000

Sampling from the Hypothesis Space

The grammar above has an infinite space of possible hypotheses as there are rules such as $Expr \longrightarrow Expr * Expr$, which could recurse infinitely as in $(Expr * Expr * Expr * Expr *\ .\ .\ .)$.

TODO: go on . . .

Sample Hypotheses Randomly

Our first approach . . .


In [5]:
# Set up sampler
h0 = make_h0(grammar=grammar, alpha=alpha)
prior_sampler = prior_sample(h0, data, num_iters)

# Iterate through samples we've generated, converting each into a 
hypotheses = set()
i = 0
for h in lot_iter(prior_sampler):
    # Update hypothesis posterior score relative to our input data
    h.compute_posterior(data)
    hypotheses.add(h)
    
# Estimate normalizing constant Z 
Z = normalizing_constant(hypotheses)

print 'all done!'


all done!

$\ p(y \in C \mid X)$ for Top 10 Most Probable Hypotheses i.e. $p(h \mid X)$


In [6]:
sorted_hypotheses = sorted(hypotheses, key=lambda x: x.posterior_score)
sorted_hypotheses.reverse()
print '\nGenerated %i hypotheses in total' % num_iters
print str(len(hypotheses)) + ' unique hypotheses\n'
print '='*80

for i in range(10):
    h = sorted_hypotheses[i]
    h_in_domain = [item for item in h() if item <= domain]
    
    # Create figure
    plt.figure(num=None, figsize=(8, 1.5), dpi=80, facecolor='w', edgecolor='k')
    plt.hist(h_in_domain, bins=domain, range=(1,domain))
    plt.title('Hypothesis: ' + str(h) +
              '\nPosterior: %.5f' % exp(h.posterior_score - Z) +
              '\nPrior: %.3f' % h.prior +
              '\nLikehd: %.3f' % h.likelihood)


Generated 5000 hypotheses in total
1329 unique hypotheses

================================================================

Sample Hypotheses with Metropolis Hastings Algorithm

Our second approach . . .


In [29]:
# Set up sampler
mh_sampler = MHSampler(h0, data, 100000)

# Iterate through samples we've generated, converting each into a 
hypotheses = set()
i = 0
for h in lot_iter(mh_sampler):
    
    viz.print_iters(i, num_iters)
    
    # Update hypothesis posterior score relative to our input data
    h.compute_posterior(data)
    hypotheses.add(h)
    
# Estimate normalizing constant Z 
Z = normalizing_constant(hypotheses)

$\ p(y \in C \mid X)$ for Top 10 Most Probable Hypotheses i.e. $p(h \mid X)$


In [28]:
sorted_hypotheses = sorted(hypotheses, key=lambda x: x.posterior_score)
sorted_hypotheses.reverse()
print '\nGenerated %i hypotheses in total' % num_iters
print str(len(hypotheses)) + ' unique hypotheses\n'
print '='*80
for i in range(10):
    h = sorted_hypotheses[i]
    h_in_domain = [item for item in h() if item <= domain]
    
    # Create figure
    figure(num=None, figsize=(8, 1.5), dpi=80, facecolor='w', edgecolor='k')
    hist(h_in_domain, bins=domain, range=(1,domain))
    title('Hypothesis: ' + str(h) +
          '\nPosterior: %.5f' % exp(h.posterior_score - Z) +
          '\nPrior: %.3f' % h.prior +
          '\nLikehd: %.3f' % h.likelihood)


Generated 500000 hypotheses in total
1448 unique hypotheses

================================================================